Skip to content

feat: add gsd:plan-review-convergence command#2339

Closed
caius-kong wants to merge 6 commits intogsd-build:mainfrom
caius-kong:feat/2306-feat-add-gsd-plan-review-convergence-com
Closed

feat: add gsd:plan-review-convergence command#2339
caius-kong wants to merge 6 commits intogsd-build:mainfrom
caius-kong:feat/2306-feat-add-gsd-plan-review-convergence-com

Conversation

@caius-kong
Copy link
Copy Markdown

@caius-kong caius-kong commented Apr 17, 2026

Linked Issue

Closes #2306


Feature summary

Adds gsd:plan-review-convergence — a cross-AI plan convergence loop that automates the manual plan-phase → review → replan → re-review chain. The orchestrator spawns isolated Agents for existing gsd-plan-phase and gsd-review Skills, then loops until no HIGH concerns remain in REVIEWS.md or max cycles (default 3) are reached.

What changed

New files

File Purpose
commands/gsd/plan-review-convergence.md Command source — installs as gsd-plan-review-convergence skill
get-shit-done/workflows/plan-review-convergence.md Workflow logic: init, convergence loop, stall detection, escalation gate
tests/plan-review-convergence.test.cjs 27 tests covering command structure, loop behavior, stall detection, escalation

Modified files

File What changed
CHANGELOG.md Added [Unreleased] → Added entry for the new command

Implementation notes

Implementation matches the approved spec exactly. The only discretionary decisions:

  • Test file name follows existing <command-name>.test.cjs convention
  • Tests use content assertion pattern (same as chain-flag-plan-phase.test.cjs, code-review-command.test.cjs) — verifies workflow contains key instruction strings the LLM will read at runtime
  • CHANGELOG entry placed under ### Added in [Unreleased] section

Spec compliance

  • New command gsd:plan-review-convergence with all documented flags (--codex, --gemini, --claude, --opencode, --all, --max-cycles N)
  • --codex is the default reviewer when no flag specified
  • Orchestrator-only loop control — planning and reviewing happen in isolated Agents calling existing Skills
  • Initial planning gate: skips if plans exist, spawns gsd-plan-phase if not, errors if no PLAN.md produced
  • Review agent spawned each cycle via gsd-review
  • HIGH concern detection by grepping REVIEWS.md
  • Loop exits when HIGH_COUNT == 0, STATE.md updated
  • Stall detection: warns when HIGH count not decreasing
  • Escalation gate at max cycles: AskUserQuestion with "Proceed anyway" / "Manual review" options
  • TEXT_MODE fallback for plain-text escalation
  • Copilot vscode_askquestions runtime note

Testing

Test coverage

tests/plan-review-convergence.test.cjs — 27 tests across 7 suites:

  • Command source structure (name prefix, flags, allowed-tools, execution_context references)
  • Workflow initialization (gsd-tools.cjs init, --max-cycles parsing, banner)
  • Initial planning gate (has_plans check, agent spawn, error on empty output)
  • Convergence loop (review agent, HIGH grep, loop exit, STATE.md update, --reviews, --skip-research)
  • Stall detection (prev_high_count tracking, warning text)
  • Escalation gate (AskUserQuestion, "Proceed anyway", "Manual review", TEXT_MODE)
  • Artifact verification (REVIEWS.md existence check, error on missing)

All 27 tests pass: node --test tests/plan-review-convergence.test.cjs

Platforms tested

  • macOS
  • Windows (backslash paths — workflow uses bash string ops; no path construction in the new files)
  • Linux

Runtimes tested

  • Claude Code (validated on real phase, 2 cycles, 13 HIGH → 0 HIGH)
  • Gemini CLI
  • OpenCode
  • Codex (used as reviewer in real-world validation)
  • Copilot
  • Other

Scope confirmation

  • The implementation matches the scope approved in the linked issue exactly
  • No additional features, commands, or behaviors were added beyond what was approved
  • Scope did not change during implementation

Checklist

  • Issue linked above with Closes #2306
  • Linked issue has the approved-feature label
  • All acceptance criteria from the issue are met (listed above)
  • Implementation scope matches the approved spec exactly
  • All existing tests pass (npm test)
  • New tests cover the happy path, error cases, and edge cases (27 tests)
  • CHANGELOG.md updated with a user-facing description of the feature
  • Documentation updated — no existing docs reference this command; workflow self-documents via <objective> and <context> tags
  • No unnecessary external dependencies added
  • Works on Windows — no path construction in new files; workflow uses bash variable substitution only

Breaking changes

None

Screenshots / recordings

Real-world validation on Phase 04.1 (COMET Evaluation Metrics, 4 PLAN.md files):

Cycle HIGH count Action
1 13 Stall check passed (13 < ∞) → replan with review feedback
2 0 Converged — STATE.md updated

Concerns caught: verdict math not switched to new metric, missing INCONCLUSIVE outcome paths, silent COMET model failure modes. All resolved in one replan cycle.

Summary by CodeRabbit

Release Notes

  • Bug Fixes

    • Enhanced validation and escalation logic for high-priority concerns in the plan review process to ensure strict format compliance and reliable issue tracking.
  • Tests

    • Added comprehensive test coverage for plan review convergence workflow, including initialization, convergence loops, stall detection, and escalation gates.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 17, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

The changes introduce a contract-based architecture for plan convergence loops. The workflow now enforces review agents to return structured CYCLE_SUMMARY metadata containing current_high counts, which the orchestrator parses via regex instead of counting markdown strings. Accompanying comprehensive test suite validates command structure, initialization, loop control, stall detection, and escalation gates.

Changes

Cohort / File(s) Summary
Workflow Contract System
get-shit-done/workflows/plan-review-convergence.md
Modified review agent to emit machine-readable CYCLE_SUMMARY line with current_high field; orchestrator now parses this contract via regex instead of counting **HIGH strings in markdown. Aborts if contract is unmet.
Test Suite
tests/plan-review-convergence.test.cjs
Added extensive Node test suite (378 lines) validating: command structure and flags, workflow initialization with --max-cycles parsing, convergence loop with agent spawning, CYCLE_SUMMARY contract extraction, stall detection with high-count tracking, and escalation gate behavior with cycle limits and user prompts.

Sequence Diagram

sequenceDiagram
    participant Orchestrator
    participant ReviewAgent as Review Agent
    participant PlanAgent as Plan Agent
    participant User
    
    Orchestrator->>Orchestrator: Init: load max_cycles, phase
    alt No plans exist
        Orchestrator->>PlanAgent: spawn gsd-plan-phase
        PlanAgent-->>Orchestrator: PLAN.md created
    end
    
    loop Each cycle <= max_cycles
        Orchestrator->>ReviewAgent: spawn gsd-review (--codex/etc)
        ReviewAgent->>ReviewAgent: analyze plans
        ReviewAgent-->>Orchestrator: return message + CYCLE_SUMMARY<br/>(current_high=N)
        Orchestrator->>Orchestrator: parse CYCLE_SUMMARY, extract HIGH_COUNT
        
        alt HIGH_COUNT == 0
            Orchestrator->>Orchestrator: converged!<br/>update STATE.md, exit
        else
            Orchestrator->>Orchestrator: check stall:<br/>HIGH_COUNT >= prev_count?
            alt HIGH_COUNT > 0 AND cycle >= max_cycles
                Orchestrator->>User: escalation gate:<br/>proceed/manual review
                User-->>Orchestrator: user choice
            else
                Orchestrator->>PlanAgent: spawn gsd-plan-phase<br/>--reviews --skip-research
                PlanAgent-->>Orchestrator: revised PLAN.md
                Orchestrator->>Orchestrator: next cycle
            end
        end
    end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Suggested reviewers

  • glittercowboy
  • trek-e

Poem

🐰✨ A loop that learns, a rabbit hops with glee,
Plans reviewed by many, now they all agree!
Contract lines ensure the truth is plain,
HIGH concerns shrink with each refrain—
Convergence reached, the phase runs free! 🎯

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed Title 'feat: add gsd:plan-review-convergence command' directly summarizes the primary change: adding a new command.
Description check ✅ Passed Description uses the correct feature template with linked issue, feature summary, detailed implementation notes, spec compliance checklist, and comprehensive testing coverage.
Linked Issues check ✅ Passed Implementation fully addresses #2306 objectives: command with all specified flags (--codex default, --gemini, --claude, --opencode, --all, --max-cycles), orchestrator-only loop control, initial planning gate, review per cycle, HIGH detection via CYCLE_SUMMARY contract, stall detection, escalation gate at max cycles, and real-world validation.
Out of Scope Changes check ✅ Passed All changes directly support the approved feature: new command file, workflow logic, comprehensive test suite, and CHANGELOG entry. No out-of-scope modifications detected.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@commands/gsd/plan-review-convergence.md`:
- Line 4: Update the argument-hint string to include the missing documented
flags so it's aligned with the CLI docs: change the existing argument-hint value
(the "argument-hint" entry in this file) from "<phase> [--codex] [--gemini]
[--all] [--max-cycles N]" to include the additional options (for example
"<phase> [--codex] [--gemini] [--claude] [--opencode] [--text] [--ws] [--all]
[--max-cycles N]"), and apply the same edit to the other argument-hint
occurrences noted elsewhere in the file (the repeated argument-hint blocks
around lines referenced in the comment).

In `@get-shit-done/workflows/plan-review-convergence.md`:
- Line 63: The markdown contains multiple fenced code blocks using ``` without
language tags which trips MD040; update each triple-backtick fence shown in the
diff (and the additional occurrences noted in the review) to include an
appropriate language identifier (for example use ```text for plain output
blocks, ```bash for shell snippets, or ```json for JSON examples) so every
fenced block is annotated and markdownlint MD040 is satisfied.
- Around line 138-140: The HIGH_COUNT assignment can produce "0\n0" because grep
-c exits nonzero when no matches and your || echo "0" runs again; change the
HIGH_COUNT pipeline to avoid emitting a second "0" and ensure a single numeric
value (for example, run grep -c ... 2>/dev/null || true to suppress the non-zero
exit, then normalize with parameter expansion like HIGH_COUNT=${HIGH_COUNT:-0});
update the assignment that defines HIGH_COUNT (and leave HIGH_LINES as-is) so
numeric comparisons against HIGH_COUNT work correctly.
- Around line 20-25: The script calls INIT via the node gsd-tools.cjs init
plan-phase command before extracting PHASE from ARGUMENTS, causing phase
metadata (phase_dir, padded_phase, has_plans, plan_count, etc.) to be
initialized with an empty/stale phase; reorder the logic so you first parse and
validate PHASE and other flags from ARGUMENTS (extract PHASE, ensure it's
non-empty, collect REVIEWER_FLAGS, MAX_CYCLES, GSD_WS, text_mode,
response_language), then run INIT=$(node
"$HOME/.claude/get-shit-done/bin/gsd-tools.cjs" init plan-phase "$PHASE") and
parse its JSON output into phase_dir, phase_number, padded_phase, phase_name,
has_plans, plan_count, commit_docs, text_mode, response_language; add explicit
validation to fail early if PHASE is empty and ensure response_language is
handled after parsing.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 06cd5b1f-a32b-4823-a43a-6cef6891d1ff

📥 Commits

Reviewing files that changed from the base of the PR and between 06c528b and d7e6ce9.

📒 Files selected for processing (4)
  • CHANGELOG.md
  • commands/gsd/plan-review-convergence.md
  • get-shit-done/workflows/plan-review-convergence.md
  • tests/plan-review-convergence.test.cjs

Comment thread commands/gsd/plan-review-convergence.md Outdated
Comment thread get-shit-done/workflows/plan-review-convergence.md Outdated
Comment thread get-shit-done/workflows/plan-review-convergence.md Outdated
Comment thread get-shit-done/workflows/plan-review-convergence.md Outdated
@trek-e trek-e added size/XL area: workflow Phase execution, ROADMAP, STATE.md labels Apr 17, 2026
Copy link
Copy Markdown
Collaborator

@trek-e trek-e left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adversarial review.

New command and workflow for a cross-AI plan convergence loop. Overall well-structured. Key findings:

Issues addressed by author: All 4 CodeRabbit findings (grep -c fallback, init-before-parse ordering, argument-hint completeness, code block language identifiers) confirmed addressed in commit c401fd3.

No security issues found: No command injection vectors, no path traversal, no unsafe eval/exec patterns. Agent spawning uses Skill() calls, not raw shell.

Structural concerns:

  1. No test for the HIGH-count stall detection — the test suite covers loop structure, escalation gates, and REVIEWS.md verification, but there's no test verifying that prev_high_count == HIGH_COUNT triggers the stall warning. This is the most fragile logic path and the one most likely to silently regress.

  2. --ws forwarding to review agent — the workflow spawns Skill('gsd-review', ...) without threading GSD_WS through to it. If a user is working in a named workspace, the review agent may read from the wrong workspace. This was flagged by CodeRabbit (id 3097651140) and the author addressed init ordering but it's not clear --ws is forwarded. Worth confirming.

  3. Max-cycles=1 edge case — with --max-cycles 1, if the initial review finds HIGHs, the loop escalates immediately without a replan. The workflow text handles this but the test suite doesn't cover it.

@trek-e trek-e added the needs changes Review requested changes that must be addressed label Apr 17, 2026
Copy link
Copy Markdown
Collaborator

@trek-e trek-e left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adversarial review — REQUEST CHANGES.

Trek-e's prior review (2026-04-17T13:15:29Z) raised three concerns. The author addressed CodeRabbit's four findings in commit c401fd3, but trek-e's structural concerns were not addressed. The PR has no response from the author on those points and no subsequent commits that cover them.

Unresolved concerns:

1. --ws forwarding to review agent (correctness bug)
The convergence workflow spawns Skill('gsd-review', args='--phase {PHASE} {REVIEWER_FLAGS}') but does not thread GSD_WS through to the review agent. If the user is working in a named workspace (--ws project-name), the review agent reads from the wrong workspace context. The replan agent does pass {GSD_WS} — making the behavior asymmetric: planning is workspace-aware, reviewing is not. The review agent must also receive the workspace flag.

2. Missing stall detection test
The test suite has 7 suites covering command structure, init, planning gate, convergence loop, stall detection, escalation gate, and artifact verification. The stall detection suite tests that prev_high_count is tracked and that a warning is emitted — but no test verifies the condition HIGH_COUNT >= prev_high_count triggers the warning when the count does not decrease between cycles. The existing stall tests are structural (string presence) rather than behavioral. This is the most fragile path: a refactor of the stall variable names or comparison would break stall detection silently with no test catching it.

3. --max-cycles 1 edge case
With --max-cycles 1, a cycle 1 review that finds HIGHs immediately hits the escalation gate without a replan attempt. The workflow handles this correctly in prose, but the test suite has no coverage for this boundary. Users who set --max-cycles 1 as a dry-run/review-only mode will hit behavior that is untested.

Required fixes before approval:

  • Thread GSD_WS through to the review agent spawn call in plan-review-convergence.md
  • Add a behavioral stall detection test: simulate decreasing HIGH count (no stall warning) and stable HIGH count (stall warning emitted)
  • Add a --max-cycles 1 test covering the immediate escalation path

The spec, overall structure, and orchestrator-only design are correct. These are targeted gaps, not architectural problems.

@trek-e
Copy link
Copy Markdown
Collaborator

trek-e commented Apr 20, 2026

CodeRabbit finding audit — all four findings were valid and are confirmed addressed.

The most recent CodeRabbit review on this PR (run ID 2fa0485c) found no actionable comments, indicating commit c401fd3 addressed all four findings from the prior review:

  1. Argument-hint missing flags — VALID, addressed.
  2. Fenced code blocks without language tags (MD040) — VALID, addressed.
  3. HIGH_COUNT grep -c producing double "0" — VALID, addressed.
  4. INIT called before PHASE is parsed — VALID, addressed.

The trek-e adversarial review (CHANGES_REQUESTED, 2026-04-20) has three structural concerns that are separate from the CodeRabbit findings and remain open: (1) --ws not forwarded to review agent, (2) stall detection test is structural not behavioral, (3) --max-cycles 1 not tested. These are unrelated to CodeRabbit's findings but require author response before merge.

@caius-kong
Copy link
Copy Markdown
Author

Thanks for the thorough review @trek-e. All three concerns are addressed in commit 65c8c23:

1. --ws forwarding to review agent — fixed. Review agent spawn now passes {GSD_WS} to the gsd-review skill args, symmetric with replan agent. Added review agent spawn forwards --ws via {GSD_WS} test to catch regressions.

2. Behavioral stall detection tests — added. The suite now includes:

  • BEHAVIORAL: decreasing HIGH count (progress) does NOT trigger stall — 5→3, 10→1
  • BEHAVIORAL: stable HIGH count (no progress) DOES trigger stall — 5→5 (the classic revision-loop case)
  • BEHAVIORAL: increasing HIGH count (regression) DOES trigger stall — 5→7, 3→13
  • stall condition uses >= (covers both "stable" AND "increasing" cases) — guards against the > vs >= refactor risk you identified

3. --max-cycles 1 edge case — covered. Added:

  • BEHAVIORAL: --max-cycles 1 with HIGHs in cycle 1 immediately escalates (no replan)
  • BEHAVIORAL: --max-cycles 3 (default) only escalates after 3 cycles

Bonus fix (CYCLE_SUMMARY contract): In real-world validation on a phase with 3 review cycles, I discovered the raw-grep approach in cycle 2 triggered a false stall: the reviewer's cycle 2 output re-listed resolved HIGHs for audit, inflating the grep -c '**HIGH' count from 5 to 11 even though convergence was actually progressing (3 genuinely new HIGHs).

Fixed by mirroring gsd-plan-phase's pattern — instead of re-reading the artifact file, the orchestrator now parses a CYCLE_SUMMARY: current_high=<N> line from the review agent's return message. The contract explicitly excludes resolved HIGHs, historical references, and retrospective summary table mentions. Fail-loud if the contract is violated (no silent defaulting).

Test count: 27 → 39. All green locally (node --test tests/plan-review-convergence.test.cjs).

@caius-kong caius-kong force-pushed the feat/2306-feat-add-gsd-plan-review-convergence-com branch from 65c8c23 to 0cd8f81 Compare April 20, 2026 09:52
@trek-e
Copy link
Copy Markdown
Collaborator

trek-e commented Apr 21, 2026

@caius-kong — the three concerns from the previous review are addressed. The CYCLE_SUMMARY contract for extracting HIGH count from the agent's return message (rather than re-grepping REVIEWS.md) is the correct design choice — REVIEWS.md accumulates historical content across cycles so a raw grep would false-stall on resolved issues that are still mentioned for audit purposes.

This PR just needs a rebase to resolve the merge conflict before it can land. Once rebased on main, it should be ready to merge.

@caius-kong
Copy link
Copy Markdown
Author

The rebase is already done — commit 0cd8f81 was rebased on main before your approval landed. The PR should be conflict-free and ready to merge.

@trek-e
Copy link
Copy Markdown
Collaborator

trek-e commented Apr 23, 2026

it says right on the screen that there are conflicts, needs to be resolved.

caiuskong and others added 4 commits April 24, 2026 11:16
Cross-AI plan convergence loop that automates the plan→review→replan
cycle until no HIGH concerns remain or max cycles (default 3) reached.

- Orchestrator-only loop control: spawns isolated Agents for existing
  gsd-plan-phase and gsd-review Skills — no new planning/review logic
- Stall detection: warns immediately when HIGH count not decreasing
- Escalation gate: AskUserQuestion when max cycles reached with HIGHs
- TEXT_MODE fallback for plain-text escalation (Copilot compatibility)
- vscode_askquestions runtime note for VS Code Copilot

New files:
  commands/gsd/plan-review-convergence.md
  get-shit-done/workflows/plan-review-convergence.md
  tests/plan-review-convergence.test.cjs (27 tests, all passing)

Closes gsd-build#2306

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- argument-hint: add missing --claude, --opencode, --text, --ws flags
- workflow step ordering: parse ARGUMENTS before calling gsd-tools init
  so PHASE is populated when init plan-phase is invoked
- HIGH_COUNT: use || true + parameter expansion to avoid "0\n0" when
  grep -c finds no matches and exits nonzero
- MD040: add language tags to all fenced code blocks (bash/text/js)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
gsd-build#2306)

Addresses the three structural concerns from trek-e's adversarial review
and also fixes a latent cycle-awareness bug in HIGH detection.

Changes:

1. Thread {GSD_WS} through to review agent spawn call (symmetric with replan).
   Previously review agent ran in default workspace when user passed --ws,
   while replan agent correctly received the flag. This was a real
   correctness bug for named workspaces.

2. Replace raw grep of REVIEWS.md with CYCLE_SUMMARY contract parsed from
   the review agent's return message. This mirrors gsd-plan-phase's pattern
   of parsing structured data from the checker's return (not re-reading
   the artifact file).

   Rationale: REVIEWS.md accumulates historical content across cycles.
   A cycle 2 review that lists resolved HIGHs for audit would inflate the
   raw grep count and falsely trigger stall detection, even though
   convergence was actually progressing. CYCLE_SUMMARY asks the reviewer
   to report ONLY current unresolved HIGHs (excluding resolved, historical
   refs, retrospective summary tables). Fail-loud if the contract is
   violated — no silent defaulting.

3. Strengthen test suite from 27 → 39 tests:
   - BEHAVIORAL stall detection tests: verify the >= condition correctly
     classifies decreasing/stable/increasing HIGH transitions (catches
     logic inversions or > vs >= regressions).
   - BEHAVIORAL --max-cycles 1 edge case: verify cycle=1/max=1/HIGHs>0
     immediately escalates without replan, and cycle=1/max=1/HIGHs=0
     converges via the HIGH_COUNT==0 branch.
   - New structural tests: step ordering (Parse ARGUMENTS before init),
     --ws forwarding to review agent, --ws forwarding to replan agent,
     CYCLE_SUMMARY regex presence, CYCLE_SUMMARY exclusion rules,
     CYCLE_SUMMARY contract fail-loud abort.

Closes gsd-build#2306

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR gsd-build#2595 landed a repo-wide slash-prefix rename while this PR was open.
The rebase took our workflow wholesale (preserving CYCLE_SUMMARY + 10
extra tests that Logan's gsd-build#2390 copy lacked), so the rename needs to be
reapplied to the 6 affected lines (spawn prompts + "next steps" text).

gsd-tools.cjs is a filesystem path, not a slash command — left alone.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@caius-kong caius-kong force-pushed the feat/2306-feat-add-gsd-plan-review-convergence-com branch from 0cd8f81 to 4c6367e Compare April 24, 2026 03:18
@caius-kong
Copy link
Copy Markdown
Author

You're right — apologies, my prior reply was based on a stale mergeability snapshot. Re-investigated and found the actual picture:

Root cause of the recurring conflict:

Rebase now done (4c6367e):

  1. Rebased on upstream/main using -X theirs to take our branch's version for the add/add conflicts — this preserves the CYCLE_SUMMARY contract, behavioral stall tests, and --max-cycles 1 boundary coverage that PR docs: authoritative shipped-surface inventory with filesystem-backed parity tests #2390's copy of our files does not have.
  2. Applied the /gsd-<cmd>/gsd:<cmd> rename from PR fix(#2543): replace legacy /gsd-<cmd> syntax with /gsd:<cmd> across source files #2595 to the 6 affected lines (spawn prompts + "next steps" text). gsd-tools.cjs left alone since it's a filesystem path, not a slash command.
  3. Tests: 39/39 green locally (node --test tests/plan-review-convergence.test.cjs).

Status now: mergeable: MERGEABLE. Net value over the #2390 snapshot: CYCLE_SUMMARY cycle-awareness + 10 additional tests.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
get-shit-done/workflows/plan-review-convergence.md (1)

274-274: ⚠️ Potential issue | 🟡 Minor

Update success criteria to reflect CYCLE_SUMMARY parsing.

The success criteria states "grep HIGHs" but the workflow now extracts HIGH counts from the review agent's CYCLE_SUMMARY contract (lines 157-166), not by grepping REVIEWS.md. This inconsistency could confuse implementers.

📋 Suggested fix
-- [ ] Orchestrator only does: init, loop control, grep HIGHs, stall detection, escalation
+- [ ] Orchestrator only does: init, loop control, parse CYCLE_SUMMARY for HIGH count, stall detection, escalation
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@get-shit-done/workflows/plan-review-convergence.md` at line 274, The success
criteria line stating "grep HIGHs" is outdated; update the Orchestrator success
criteria to reflect that it should parse the review agent's CYCLE_SUMMARY
contract (see CYCLE_SUMMARY parsing logic referenced around lines 157-166) to
extract HIGH counts instead of grepping REVIEWS.md, and clarify that the
Orchestrator only handles init, loop control, stall detection, extraction of
HIGH counts from CYCLE_SUMMARY, and escalation.
🧹 Nitpick comments (3)
get-shit-done/workflows/plan-review-convergence.md (3)

163-164: Consider validating the "Current HIGH Concerns" section format.

The contract requires both CYCLE_SUMMARY (line 132) and the "## Current HIGH Concerns" section (line 141), but the abort behavior (line 163) only validates CYCLE_SUMMARY. An agent could include a valid CYCLE_SUMMARY but omit or misformat the concerns section, leading to empty HIGH_LINES during escalation display (lines 208, 222).

While this won't break control flow (HIGH_COUNT drives the loop, not HIGH_LINES), it degrades the user experience at escalation by showing no concern details.

🛡️ Suggested validation enhancement
 - `HIGH_COUNT`: the integer matched by regex `CYCLE_SUMMARY:\s+current_high=(\d+)`. If no match, abort with: `Review agent did not honor the CYCLE_SUMMARY contract — cannot determine HIGH count. Retry or switch reviewer.`
 - `HIGH_LINES`: the bullets under the `## Current HIGH Concerns` section of the return message (up to the next `##` heading or end of message). If the section contains only `None.`, set `HIGH_LINES=""`.
+
+If `HIGH_COUNT > 0` and `HIGH_LINES` is empty (section missing or malformed), warn: `Review agent's CYCLE_SUMMARY reports ${HIGH_COUNT} HIGHs but did not provide ## Current HIGH Concerns section — contract partially violated. Continuing with incomplete escalation details.`
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@get-shit-done/workflows/plan-review-convergence.md` around lines 163 - 164,
Add a validation step after extracting CYCLE_SUMMARY/HIGH_COUNT that also
verifies the "## Current HIGH Concerns" section was present and parsed into
HIGH_LINES (or explicitly contains only "None."); if the section is missing or
malformed (i.e., HIGH_LINES undefined or not a string/empty when not "None."),
abort with a clear error similar to the CYCLE_SUMMARY check (e.g., "Review agent
did not honor the CYCLE_SUMMARY contract — cannot determine HIGH concerns. Retry
or switch reviewer.") so the workflow doesn't proceed with absent/misformatted
HIGH_LINES; reference the CYCLE_SUMMARY, HIGH_COUNT and HIGH_LINES
extraction/parsing logic and the "## Current HIGH Concerns" section when
implementing this check.

134-136: Clarify "PARTIALLY RESOLVED" definition to ensure consistent counting.

The INCLUDE list mentions "PARTIALLY RESOLVED HIGHs" but doesn't define what distinguishes "partially resolved" from "fully resolved." Different review agents or cycles might interpret this differently, leading to inconsistent HIGH counts that could trigger false stall detection.

Consider adding a brief definition or criteria, such as:

  • PARTIALLY RESOLVED: concern acknowledged with mitigation in progress but not yet verified/completed
  • FULLY RESOLVED: concern addressed with verification complete
📝 Suggested clarification
 Where <N> counts HIGH-severity concerns that REMAIN UNRESOLVED in this cycle's findings:
-  - INCLUDE: newly raised HIGHs; PARTIALLY RESOLVED HIGHs; previously raised and still unresolved HIGHs
+  - INCLUDE: newly raised HIGHs; PARTIALLY RESOLVED HIGHs (acknowledged but mitigation incomplete); previously raised and still unresolved HIGHs
   - EXCLUDE: explicitly RESOLVED/FULLY RESOLVED HIGHs; HIGH mentions in retrospective/summary tables comparing cycles; quoted excerpts from prior reviews
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@get-shit-done/workflows/plan-review-convergence.md` around lines 134 - 136,
Add clear, short definitions for the terms "PARTIALLY RESOLVED" and "FULLY
RESOLVED" next to the INCLUDE/EXCLUDE guidance so reviewers count HIGHs
consistently; specify that PARTIALLY RESOLVED means the HIGH is acknowledged and
a mitigation or fix is in progress but not yet verified/completed (include
expected evidence like an open ticket or remediation plan), and that FULLY
RESOLVED means the issue is addressed and verification/closure steps are
complete (include evidence such as closed ticket, verification log, or reviewer
sign-off). Reference the existing INCLUDE list wording ("PARTIALLY RESOLVED
HIGHs" and "explicitly RESOLVED/FULLY RESOLVED HIGHs") and insert the two
one-line criteria immediately after that list.

163-163: Consider adding validation for malformed CYCLE_SUMMARY values.

The current abort message "did not honor the CYCLE_SUMMARY contract" covers both missing and malformed cases, but distinguishing between them could help with debugging. For example, if the agent writes CYCLE_SUMMARY: current_high=unknown, the error could indicate format problems rather than complete absence.

🔍 Enhanced error reporting
-If no match, abort with: `Review agent did not honor the CYCLE_SUMMARY contract — cannot determine HIGH count. Retry or switch reviewer.`
+If no match:
+  - Check if a CYCLE_SUMMARY line exists (even if malformed): hint at format issues
+  - Otherwise: report completely missing contract
+Abort with: `Review agent did not honor the CYCLE_SUMMARY contract — cannot determine HIGH count. Retry or switch reviewer.`
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@get-shit-done/workflows/plan-review-convergence.md` at line 163, The parsing
of HIGH_COUNT using the regex `CYCLE_SUMMARY:\s+current_high=(\d+)` should
distinguish between a missing CYCLE_SUMMARY and a malformed value; update the
code that currently aborts with "Review agent did not honor the CYCLE_SUMMARY
contract — cannot determine HIGH count. Retry or switch reviewer." to first
check for a missing match and keep the existing abort message, and if there is a
match but the captured value does not parse to an integer (or fails validation),
abort with a different, explicit message like "CYCLE_SUMMARY present but
current_high is malformed — expected integer, got '<value>'", referencing the
`HIGH_COUNT` extraction logic so callers can debug formatting vs absence.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@get-shit-done/workflows/plan-review-convergence.md`:
- Around line 157-166: The workflow lacks a concrete parser for the agent's
return message; implement extraction logic that reads the agent output
(AGENT_RETURN), captures HIGH_COUNT by matching the CYCLE_SUMMARY line (regex
for current_high=\d+), validates it is a non-empty integer and aborts with the
exact message "Review agent did not honor the CYCLE_SUMMARY contract — cannot
determine HIGH count. Retry or switch reviewer." if missing or invalid, and
extracts HIGH_LINES as the bullet lines under the "## Current HIGH Concerns"
section (stop at the next "##" or end of message), treating a section that
contains only "None." as HIGH_LINES=""; ensure robustness to minor formatting by
trimming whitespace and handling missing sections consistently.

---

Outside diff comments:
In `@get-shit-done/workflows/plan-review-convergence.md`:
- Line 274: The success criteria line stating "grep HIGHs" is outdated; update
the Orchestrator success criteria to reflect that it should parse the review
agent's CYCLE_SUMMARY contract (see CYCLE_SUMMARY parsing logic referenced
around lines 157-166) to extract HIGH counts instead of grepping REVIEWS.md, and
clarify that the Orchestrator only handles init, loop control, stall detection,
extraction of HIGH counts from CYCLE_SUMMARY, and escalation.

---

Nitpick comments:
In `@get-shit-done/workflows/plan-review-convergence.md`:
- Around line 163-164: Add a validation step after extracting
CYCLE_SUMMARY/HIGH_COUNT that also verifies the "## Current HIGH Concerns"
section was present and parsed into HIGH_LINES (or explicitly contains only
"None."); if the section is missing or malformed (i.e., HIGH_LINES undefined or
not a string/empty when not "None."), abort with a clear error similar to the
CYCLE_SUMMARY check (e.g., "Review agent did not honor the CYCLE_SUMMARY
contract — cannot determine HIGH concerns. Retry or switch reviewer.") so the
workflow doesn't proceed with absent/misformatted HIGH_LINES; reference the
CYCLE_SUMMARY, HIGH_COUNT and HIGH_LINES extraction/parsing logic and the "##
Current HIGH Concerns" section when implementing this check.
- Around line 134-136: Add clear, short definitions for the terms "PARTIALLY
RESOLVED" and "FULLY RESOLVED" next to the INCLUDE/EXCLUDE guidance so reviewers
count HIGHs consistently; specify that PARTIALLY RESOLVED means the HIGH is
acknowledged and a mitigation or fix is in progress but not yet
verified/completed (include expected evidence like an open ticket or remediation
plan), and that FULLY RESOLVED means the issue is addressed and
verification/closure steps are complete (include evidence such as closed ticket,
verification log, or reviewer sign-off). Reference the existing INCLUDE list
wording ("PARTIALLY RESOLVED HIGHs" and "explicitly RESOLVED/FULLY RESOLVED
HIGHs") and insert the two one-line criteria immediately after that list.
- Line 163: The parsing of HIGH_COUNT using the regex
`CYCLE_SUMMARY:\s+current_high=(\d+)` should distinguish between a missing
CYCLE_SUMMARY and a malformed value; update the code that currently aborts with
"Review agent did not honor the CYCLE_SUMMARY contract — cannot determine HIGH
count. Retry or switch reviewer." to first check for a missing match and keep
the existing abort message, and if there is a match but the captured value does
not parse to an integer (or fails validation), abort with a different, explicit
message like "CYCLE_SUMMARY present but current_high is malformed — expected
integer, got '<value>'", referencing the `HIGH_COUNT` extraction logic so
callers can debug formatting vs absence.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: d373578b-7291-41dd-aa81-42549ce38d2b

📥 Commits

Reviewing files that changed from the base of the PR and between 0cd8f81 and 4c6367e.

📒 Files selected for processing (1)
  • get-shit-done/workflows/plan-review-convergence.md

Comment thread get-shit-done/workflows/plan-review-convergence.md
…ria (gsd-build#2306)

Addresses CodeRabbit review on 4c6367e:

1. Success criteria in <success_criteria> still said "grep HIGHs" after
   we migrated to the CYCLE_SUMMARY contract. Updated to
   "parse CYCLE_SUMMARY for HIGH count" so implementers are not misled
   by the obsolete description.

2. CodeRabbit (Major) asked for concrete bash parsing code for the
   CYCLE_SUMMARY extraction. Clarified in a new "Parsing layer:" note
   that this extraction is performed by the orchestrator LLM on the
   Agent's return message, not a shell pipeline — there is no
   $AGENT_RETURN variable for grep/sed to consume, the return message
   is in-context for the orchestrator. This is the same data-flow shape
   gsd-plan-phase uses with gsd-plan-checker's return. Expressing it as
   shell code would imply a runtime that does not exist here.

Tests: 39/39 still green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@caius-kong
Copy link
Copy Markdown
Author

caius-kong commented Apr 24, 2026

Thanks @coderabbitai. Both findings addressed on the current branch tip:

Finding 1 (success criteria says "grep HIGHs") — valid, fixed. Updated <success_criteria> to parse CYCLE_SUMMARY for HIGH count, matching the workflow body.

Finding 2 (concrete implementation for CYCLE_SUMMARY parsing) — not taking the bash suggestion. The extraction targets the Agent's return message text (in-context for the orchestrator LLM), not a file or shell variable. There is no $AGENT_RETURN for grep -oP/sed to consume — the orchestrator reads the return message directly from the Agent(...) call's result. The surrounding text already says "Parse the agent's final return message" and the extraction rules are expressed as regex/section references (not shell syntax), so adding shell code would imply a runtime stage that doesn't exist here and contradict the data-flow (the same shape gsd-plan-phase uses when parsing gsd-plan-checker's return).

Also a secondary concern with the suggested snippet: grep -oP ... \K is GNU-only (PCRE + variable-length look-behind); macOS BSD grep would fail silently here.

Tests: 39/39 still green.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 24, 2026

Only users with a collaborator, contributor, member, or owner role can interact with CodeRabbit.

…parsing (gsd-build#2306)

Added a defensive "Parsing layer:" paragraph in e38f66d to tell readers
not to look for a $AGENT_RETURN bash variable. On review, this note
was addressing a concern the surrounding text does not actually create:
"Parse the agent's final return message" and the regex-based extraction
rules already make clear this is not a shell pipeline, and no other
section mentions $AGENT_RETURN. The note was appeasement for the bot
review, not clarification for a human reader.

Keeping the success criteria sync from e38f66d — that one was a real
inconsistency (obsolete "grep HIGHs" after we moved to CYCLE_SUMMARY).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 24, 2026

Only users with a collaborator, contributor, member, or owner role can interact with CodeRabbit.

@caius-kong
Copy link
Copy Markdown
Author

Summary of this round — PR is now ready:

  • Rebased on upstream/main (d6b1a94). Merge conflicts resolved.
  • File count dropped 4 → 2 in the PR diff. Not a regression: PR docs: authoritative shipped-surface inventory with filesystem-backed parity tests #2390 independently landed byte-identical copies of commands/gsd/plan-review-convergence.md and the CHANGELOG entry while this PR was open, so those files now "cancel out" against main. The two remaining files (workflows/plan-review-convergence.md, tests/plan-review-convergence.test.cjs) carry the real net value-add over #2390's snapshot: CYCLE_SUMMARY contract + 10 behavioral tests.
  • CodeRabbit's latest findings addressed: success-criteria sync fixed; the bash implementation request declined with reasoning (comment above).

Status: mergeable: MERGEABLE, tests 39/39 green.

Copy link
Copy Markdown
Collaborator

@trek-e trek-e left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adversarial review — requesting changes

This is a cross-AI plan convergence loop with replanning. I'm evaluating the iteration on top of #2306 as shipped, plus what this PR actually changes (160/72 touching workflow + tests; CYCLE_SUMMARY contract, --ws symmetry, behavioral tests, slash-prefix rename).

Gall's Law — was this evolved from something simpler?

Yes, and visibly. Commit history shows this started as grep-of-REVIEWS.md, then trek-e's review + a latent cycle-awareness bug drove the move to a CYCLE_SUMMARY return contract. That is a legitimate evolutionary step from a working simpler system; credit where due. But the system is now simultaneously (a) a shell orchestrator writing $HIGH_COUNT-style pseudocode and (b) an LLM-level parsing layer — and the PR dropped the "Parsing layer:" note in d6b1a94 on the grounds that "surrounding text already implies LLM parsing." It does not imply that for anyone reading HIGH_COUNT=...regex match... at get-shit-done/workflows/plan-review-convergence.md:162-164, which is written as if something executes it. The note was not appeasement; the note was the one line telling a future implementer not to try to wire bash around this. Put it back.

Goodhart's Law — the termination metric is the vulnerability

"Loop until HIGH_COUNT == 0 or MAX_CYCLES" with HIGH_COUNT reported by the reviewer itself creates a target-becomes-measure: the reviewer learns (or is coached by the replanner) that downgrading HIGHs to MEDIUMs ends the loop. The CYCLE_SUMMARY contract at plan-review-convergence.md:125-142 is an honor-system instruction to an external AI — Codex/Gemini/Copilot — that has no skin in distinguishing HIGH from MEDIUM accurately. The "EXCLUDE: explicitly RESOLVED HIGHs" rule at :134 is especially exploitable: a replan that renames a concern rather than fixing it satisfies the contract. There is no guard against severity drift across cycles (e.g., tracking that no previously-HIGH became MEDIUM without a RESOLVED marker). Add one, or acknowledge this in the workflow as a known failure mode rather than shipping the metric as reliable.

Peter Principle — unbounded adversarial reviewer

If the external reviewer keeps inventing new HIGHs each cycle (not the "stall" case — the "invents different HIGHs each time" case), prev_high_count comparison at :170 only catches non-decreasing counts. 5 fresh HIGHs → 5 different fresh HIGHs reads as stall; 5 → 4 → 3 cascading churn of different concerns reads as progress and exhausts all 3 cycles. There is no "churn detection" (HIGH identity tracking across cycles). MAX_CYCLES caps wall time but not reviewer adversariality. Acceptable for this PR if called out — currently it is not.

Kernighan — state machine

Control flow is fine: 5a → 5b → 5c → 5d → 5a is linear. No clever async. prev_high_count is initialized before the loop (confirmed :170 reference), so the first-cycle stall-false-positive case is handled. This part is boring in the good way.

Knuth — cost per cycle unmeasured

3 cycles × (1 plan-phase + 1 cross-AI review + 1 replan) = ~9 LLM-agent invocations worst case, each themselves spawning sub-skills. No measurement, no budget, no --max-tokens or --max-cost escape hatch. For an XL feature whose name is "convergence loop" this is the first question an operator will ask. At minimum, log per-cycle timing so users can see what they paid.

Minor

  • plan-review-convergence.md:249 — success_criteria still says "grep HIGHs". e38f66d was supposed to sync this. Verify: the diff shows it was updated to "parse CYCLE_SUMMARY for HIGH count", but the file at HEAD above still reads "grep HIGHs" at line 249. Check this is actually committed.
  • CYCLE_SUMMARY contract failure mode at :144 is fail-loud ("abort"). Good. But there is no retry with a different reviewer — one malformed return ends the whole loop. Consider one retry before abort.
  • tests/plan-review-convergence.test.cjs — 39 tests are all content assertions on workflow strings. These verify the prompt text exists, not that the LLM actually honors it. That is the limit of the pattern, but claiming "BEHAVIORAL" in test names (e.g. BEHAVIORAL: decreasing HIGH count...) is overclaim: those tests simulate a JS function stallCondition, not the workflow. Rename to SIMULATED: or drop the caps.

Verdict

Changes requested. Core loop evolution is sound; the CYCLE_SUMMARY migration is a real improvement over raw grep. Blockers: (1) success_criteria line 249 still says grep, (2) no churn detection / severity-drift guard against the Goodhart attack the loop invites, (3) no cost observability for a loop that costs N×3 LLM calls, (4) restore the dropped Parsing-layer note. Address these, then this ships.

@trek-e trek-e added the feature-request New feature proposal label Apr 24, 2026
@trek-e trek-e added enhancement New feature or request area: commands Slash commands review: changes requested PR reviewed — changes required before merge priority: medium Quality-of-life, affects some users labels Apr 24, 2026
@trek-e trek-e closed this in b401101 Apr 25, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area: commands Slash commands area: workflow Phase execution, ROADMAP, STATE.md enhancement New feature or request feature-request New feature proposal needs changes Review requested changes that must be addressed priority: medium Quality-of-life, affects some users review: changes requested PR reviewed — changes required before merge size/XL

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: add gsd:plan-review-convergence command — cross-AI plan convergence loop

3 participants